DOCUMENT RESUME 



ED 351 383 



TM 019 215 



AUTHOR 
TITLE 

INSTITUTION 
SPONS AGENCY 
REPORT NO 
PUB DATE 
CONTRACT 
NOTE 

PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Nandakuinar » Ratna ; S t out » Wi 1 1 iam 

Refinements of Stout's Procedure for Assessing Latent 
Trait Unidimens ional ity* 

Illinois Univ*, Urbana, Dept. of Statistics. 
Office of Naval Research, Arlington, Va. 
1992-1; ONR-4421-548 
1 Aug 92 

N00014-90-J-1940 

52p.; Paper to be published in the "Journal of 

Educati onal Statistics." 

Reports - Evaluative/Feasibility (142) 

MF01/PC03 Plus Postage. 

Computer Simulation; '^Computer Software; Equations 
(Mathematics) ; Estimation (Mathematics) ; Guessing 
(Tests); Hypothesis Testing; ''^Item Response Theory; 
^Mathematical Models; ^Psychological Testing; 
Statistical Significance; ^^Test Items 
American College Testing Program; Armed Services 
Vocational Aptitude Battery; Binary Response; DIMTEST 
(Computer Program) ; '^Stouts Procedure; 
^Uni dimensional ity (Tests) 



ABSTRACT 

A detailed investigation of the statistical procedure 
of W. Stout (the computer program DIMTEST) for testing the hypothesis 
that an essentially unidimensional latent trait model fits observed 
binary response data from a psychological test is presented. One 
finding is that DIMTEST may fail to perform as desired in the 
presence of guessing when coupled with many high-discriminating test 
items. A revision of DIMTEST is proposed to overcome this limitation. 
Also, an automatic approach is devised to determine the size of the 
assessment subtests. Further, an adjustment is made on the estimated 
standard error of the statistic on which DIMTEST depends. These three 
refinements result in an improved procedure that is shown in 
simulation studies to adhere closely to the nominal level of 
significance while achieving considerably greater power. Finally, 
DIMTEST is validated on real data sets from the Armed Services 
Vocational Aptitude Battery (1,984 and 1,961 examinees) and the 
American College Testing program (2,491 and 2,494 examinees). Seven 
tables present analys is resul ts , and 46 references are included. 
(Author/SLD) 



Vc Vc Vc :'c Vc :V :'r :'r I'c ^'r Vr :'r Vr :'r Vr Vr :'r Vr Vc ^'r "k -i: Vc Vr it Vc V? Vr Vc ^'r ^'c Vc it it it i: it i: it it it it it it it it it it it it it it it it 

Reproductions supplied by EDRS are the best that can be made ^ 
* from the original document. 

it i^icititititiiit it it it it it it it it i< it it it it it it it it i< it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it 



U.S. DCPAm-MCNT OF EDUCATION 
OHtcm ol EduC*li€>f«t Rt«MiCh tod lfnpfov»m«ni 

CCJ edlk^atjonal resources information 

^ •* CENTER (ERIC) 

/*^^ 0Thi» document htt t>«tn fiproducid •» 

r*c«iv*d frorr. !!>• p#r»on Of ofO«n"«>'0" 

originating it 

f^?^ O Mirwr chtng«i ri«v« b»in mic»« to impfovi 

^' ' raprodiiction qutnty ^ 

• Po<m« of v»«w or opiftiont »ttttd in thisdocw 
m«nt c*o not n«ctM«f»ty f«PfM«nt oHicial 
OEni po«tt(Onor policy 



GO 



Refinements of Stout's Procedure for Assessing 
Latent Trait Unidimensionality 



Ratna NandaJcmnar^ 
Department of Educational Studies 
University of Delaware 

William Stout 
Department of Statistics 
University of Illinois 

August 1, 1992 



Prepai-ed for the Cognitive Science Research Program, Cognitive and Neural Sciences 
Division, Office of Naval Research, under grant number N00014-90-J-1940, 4421-548. Ap- 
proved for public release, distribution unlimited. Reproduction in whole or in part is 
permitted for any purpose of the United States Government. 



1 The authors would like to convey special thanks for Brian Junker for his time and 
insightful comments on eailier drafts of this manuscript; to Mark Recka^e and Robert Linn 
for providing us wit., the real data sets; and to Roderick McDonald for allowing us to use 
the nonlinear factor analysis program NOFA. 



2 



REPORT DOCUMENTATION PAGE 1 


Form Approved 
0MB No. 0704-0163 




1, AGENCY USE ONLY fteave bUnk) I 2. REPORT DATE j 3. REPORT TYPE ANC 

M Auqust 1992 1 Technical: 


) DATES COVERED 


4. TITLE AND SUBTITLE 

Refinements of Stout's Procedure for Assessing Latent 
Trait Unidimensional ity 


S. FUNDING NUMBERS 

N000l^-30-J-19^0 


6. AUTHOR(S} 

Ratna Nandakumar and Will i am Stout 


7. PERFOftMlNG ORGANIZATION NAME{S) AND ADDRESS(ES) 

Department of Statistics 
Univeristy of minois 
725 South Wright Street 
Champaign, IL 61820 


8, PERFORMING ORGANIZATION 
REPORT NUMBER 

1992 - No. 1 


9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRE5S(ES) 

Cognitive Sciences Program 
Office of Naval Research 
800 North (iuincy 
Arlington, VA 22217-5000 


10. SPONSORING/ MONITORING 
AGENCY REPORT NUMBER 

NR 1^421-5^8 


11. SUPPLEMENTARY NOTES ^ 

To be published in Journal of Educational Statistics. Soft 
out procedures available from authors. 


ware to carry 


12a. DISTRIBUTION /AVAILABILITY STATEMENT 

Approved for public release; distribution unlimited 


12b. DISTRIBUTION CODE 


13. ABSTRACT (Maximum 200 words) 

See reverse 



14. SUBJECT TERMS 

See reverse 



1S. NUMBER OF PAGES 

^7 



16. PRICE CODE 



17. SECURITY CLASSIFICATION 
OF REPORT 

unclassl f led 



18. SECURITY CLASSIFICATION 
OF THIS PAGE 

unclassified 



19. SECURITY CLASSIFICATION 
OF ABSTRACT 

unclassified 



20. LIMITATION OF ABSTRACT 



IMSN 7540.01-280-5500 



Standard Form 298 (Rev 2-89) 

Pff»SCr«l>*KJ by ANSI Sid 239-^8 
298-102 



Refinements of Stout's Procedure for Assessing 
Latent Trait Unidimensionaiity 



Abstract 

This paper provides a detailed investigation of Stout's statistical procedure (the 
computer program DIMTEST) for testing the hypothesis that an essentially 
unidimensional latent trait model fits observed binary item response data from a 
psychological test. One finding was that DIMTEST may fail to perform as desired in the 
presence of guessing when coupled with many high--discriminating items. A revision of 
DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised 
to determine t^e size of the assessment subtests. Further, an adjustment is made on the 
estimated standard error of the statistic on which DIMTEST depends. These three 
refinements have led to an improved procedure that is shown in simulation studies to 
adhere closely to the nominal level of significance while achieving considerably greater 
power. Finally, DIMTEST is validated on a selection of real data sets. 

Subject terms: Unidimensionaiity, essential independence, essential unidimensionaiity, 
DIMTEST, item response theory. 



ERLC 



Stout's procedure for unidimensionality — 2 



Refinements of Stout's Procedure iox Assessing 
Latent Trait Unidimensionality 



Item response theory (IRT) is presently one of the most widely used techniques in 
psychometrics and is likely to ::emain so in the future. Some applications of IRT include 
ability estimation, item/test bias, equating, and adaptive testing. The three assumptions 
underlying many commonly used IRT models are monotonicity, unidimensionality ((i=l), 
and local independence (LI). Monotonicity assumes that the probability of correctly 
answering an item increases as ability increases. Unidimensionality^ assumes that the items 
of a test measure a single ability. Local independence assumes that given any particular 
level of ability, responses to different items are independent. This paper is concerned with 
the statistical assessment of the assumption of unidimensionality. Most IRT models 
specifically require this assumption; moreover, classical test theory models implicitly 
assume that items measiue the same dominant dimension. In spite of the importance of 
this assumption, it is also well known that actual data are rarely strictly unidimensional. It 
has long been argued that items are multiply determined and that, in addition to 
measuring the intended attribute, other attributes unique to individual items or common to 
relatively few items are unavoidable (Humphreys, 1981, 1985, 1986; Hambleton & 
Swaminathan, 1985; Reckase, 1979, 1985; Stout, 1987; Traub, 1983; Yen, 1985). In addition 
to the multiple item attributes that influence dimensionality, examinee characteristics such 
as differential teaching methods, the point of time during the instructional unit that the 
test is given, and so forth, also can influence the dimensionality of a set of items 
(Birenbaum & Tatsuoka, 1982; Bejar, 1983; Traub, 1983). Dimensionality is therefore a 
property of both the test and the examinee population taking the test (Reckase, 1990). 

Linear factor analysis (subjectively interpreted in the absence of a statistical 
distribution theory) has been the traditional approach for assessing the dimensionality of a 



Stout's procedure for UBidimeusionality — 3 



set of items. If the results of a linear factor analysis reveal only one significant factor, then 
the test is considered unidimensional. In the case of dichotomous data, however, it is well 
known that linear factor analysis of phi correlations between items often leads to 
overestimation of the number of factors underlying the item responses (Carroll, 1945; 
Hambleton k Swaminathan, 1985, Chapter 2; Hulin, Drasgow, & Parsons, 1983, Chapter 8; 
McDonald & Ahlwat, 1974). As a corrective alternative, tetrachoric correlations can be 
used for factor analysis. When guessing is present in the responses to items, however, linear 
factor analysis of tetrachoric correlations can produce a spurious factor due to difficulty of 
test items (Hulin, Drasgow, & Parsons, 1983, Chapter 8). In addition, computation of 
tetrachoric correlations can be problematic if any one of the correct /incorrect cells of the 
two— by— two item response tables contains a zero. Matrices of simple tetrachoric 
correlations are thus often non— gramian. As a result, conventional methods of factor 
analysis by phi or tetrachoric correlations are often unsatisfactory for assessing the 
dimensionality of test items. Christoffersson (1975) and Muthen (1978) have developed 
generalized least squares methods to overcome the problems with factor analysis of 
tetrachoric correlations, but their methods are limited to 25 items at most. Moreover, they 
are computationally intensive. 

In recent years a vast body of literature has been developed for assessing the 
dimensionality of test items. Comprehensive reviews of different procedures for assessing 
dimensionality are provided by Hattie (1984, 1985), and Hulin, Drasgow, and Parsons 
(1983, Chapter 8). Some of the more recent procedures developed to assess latent trait 
dimensionality include: maximum likelihood full information factor analysis (Bock, 
Gibbons, & Muraki, 1985); the Tucker and Humphreys procedures based on local 
independence and first and second factor loadings (Roznowski, Tucker, & Humphreys, 
1991); Stout's (1987) procedure for assessing unidimensionality based on the theory of 
essential independence; modified parallel analysis, which combines latent trait methods and 



ERLC 



6 



Stout's procedure for unidimonsionality — 4 



factor analysis and uses eigenvalues of tetrachoric correlation matrix (Huiin, Drasgow, &: 
Parsons, 1983, p. 255); McDonald's nonlinear factor analysis (McDonald, 1962; McDonald 
& Ahlawat, 1974; Etazadi— Amoli & McDonald, 1983); Holland and Rosenbaum's test of 
unidimensionality, monotonidty, and conditional independence (Holland, 1981; Holland &; 
Rosenbaum, 1986); residual analysis, determined by modd^-data fit (Hambleton & 
Swaminathan, 1985, Chapter, 8); and Bejar's procedure based on three—parameter logistic 
item parameter estimates (Bejar, 1980). Some of these methods are reviewed in Hambleton 
and Rovinelli (1986), Mislevy (1986), and Zwick (1987a). 

Although these different approaches offer promise for assessing the dimensionality of 
binary data, researchers in the field have not reached a consensus on one satisfactory 
method (Berger & Knol, 1990; Hambleton & Rovinelli, 1986; Hattie, 1984; Zwick, 1987b). 
Primarily, this is due to the fact that there is substantial confusion in the literature 
concerning the definition of unidimensionality. Additionally, many existing methods for 
assessing dimensionality are only loosely connected to the various definitions in the 
literature (Hambleton & Rovinelli, 1986). 

This article is concerned with Stout's procedure for assessing unidimensionality 
(DIMTEST). Stout (1987) has developed a nonparametric statistical procedure based on 
the large sample distribution theory for assessing latent trait dimensionality and has 
argued the validity of this procedure based upon simulation studies involving a variety of 
achievement tests. DIMTEST has been shown to discriminate well between one— and 
two— dimensional tests, maintaining good adherence to a specified level of significance when 
d=l and mu*iitaining good power when (i=2, even when the correlation between the 
abilities is as high as .7. 

The present study provides a detailed investigation of certain performance 
characteristics and the consequent major refinements of DIMTEST for assessing latent 
trait dimensionality. DIMTEST was found to perform undesirably in certain c^ses where 



Stout's procedure for unidimensionality — 5 



the test contained many highly discriminating items with guessing present. A correction is 
proposed to overcome this limitation. In addition, an automatic approach is devised for 
determining Af, the size of the assessment subtests; a better control of a, the specified level 
of significance, is achieved by adjusting the estimated standard error of Stout ''s statistic T. 
These refinements have led to an improved test procedure that is easier to uso and has been 
shown in simulation studies to adhere closely to the nominal level of significance while 
achieving considerably greater power. Finally, the procedure is applied to a selection of real 
data sets. 

Stout's Procedure for Assessing Unidimensionality 

As stated in the beginning of this paper, items are multiply determined, and thus 
the number of dominant abilities should be assessed in testing for dimensionality. Stout 
first informally (1987) and then formally (1990) provided a definition of the number of 
dominant dimensions known as essential dimensionality, which is derived from the theory 
of essential independence. The statistical procedure for ass^sing essential unidimensionality 
is consistent with the definition of essential dimensionality. To assist the reader in 
evaluating this claim as well as to enable the reader in understanding the refinements made 
to DIMTEST, Stout's definition of essential dimensionality will be followed by a brief 
simimary of the statistical procedure. The reader is advised, however, that use of 
DIMTEST does not require acceptance of Stout's notion of essential dimensionality, and, 
in fwCt, DIMTEST can also be viewed as a technique to detect sizable lack of fit of a locally 
independent unidimensional latent trait model. 

Let denote the i-th item response and Uj^^ {U^.U^r-'U^jj^ denote the test 
response vector for an AMtem test. Observed item and test values will be denoted by 
and 2kjsf^ (^i)'^)--'^^)) respectively. Let C/^. = 1 denote a correct response and U^ — Q 



ERLC 



8 



Stout's procedure for unidimensionality — 6 



denote an incorrect response to item i for a randomly chosen examinee. The latent random 
vector is denoted by ^ and the particular values it takes are denoted by 0. Let P^i) denote 
the probability that a randomly chosen examinee with ability will get the t-th item 
correct. It is assumed that all item response functions P^{(t) are monotone. Let 11=^ {U^^ 
%>1\ denote the item pool consisting of JZ^y-as its first iV items. The item pool is 
conceptualized as a result of continuing the test construction process in the same maimer 
beyond the construction of the iV items that make up the actual test X/'jy being studied. 
One advantage of using only the partially observed XT instead of the actually observed Uj^ 
to model the test is that a totally rigorous definition of the number of dominant dimensions 
can be given. These ideas are carefully and formally developed in Stout (1990) and 
constitute a large sample approach to test modeling. 

Definition 1 (Stout, 1990) The item pool His said to be essentially independent (EI) 
with respect to the latent variable ^ if satisfies 

I |Cov(C/.,[7.|0 = ^| 

for every £. 

The distinction between local independence and essential independence is tha" local 
independence requires Gov {U^,Uj\Q = = 0 for ail 9; whereas, essential independence 
requires the average value of | Gov ( U^, Uj\Q = 9)\ over all item pairs to be small in 
magnitude for all g as the test length increases. Hence, essential independence is a weaker 
assumption than local independence. 



ERIC 



Stout's procedure for unidimensionality - 7 



Deanition 2 (Stout, 1990). The essential dimensionality (d^) of an item pool jZis 
the minimal dimensionality (number of elements in g) necessary to satisfy the assumption 
of essential independence. When d^^ 1^ essential unidimensionality is said to hold. 

The reader should note that d^l means that II has an IRT model for which 
essential independence holds for a unidimensional latent trait 0, The ordinary definition of 
IRT dimensionality is the same a^ Definition 2, but essential independence is replaced by 
local independence and JZ replaced by Jl^^ Stout argues that the assumptions concerning 
local independence and the resulting ordinary IRT definition of dimensionality should often 
be replaced by the respective weaker assumptions concerning essential independence and 
essential dimensionality. Junker (1988, 1991) has proved results concerning essential 
independence and, in particular, has derived statistical consistency results for maximum 
likelihood estimates of ability under the assumption of essential unidimensionality. 

It can now be clearly stated what assessing the hypothesis of essential 
unidimensionality means: among all the essentially independent monotone IRT models for 

does there exists a unidimensional one? To ansv/er this question, we assume both 
monotonidty and essential independence and assess the lack of fit of unidimensionality. 
This approach is similar to most other procedures for assessing data dimensionality, with 
the exception that essential rather than local independence is assumed. 

The statistical procedure for testing the null hypothesis of essential 
unidimensionality will be briefly described here. For further details see Stout (1987, Sec. 4). 
The iVtest items are split into two assessment subtests of length M each — called the 
Assessment 1 subtest (ATI) and the Assessment 2 subtest (AT2) — and a longer subtest 
called the partitioning subtest (PT) of length n (= The M items for subtest ATI 

are selected to have the same dominant trait. This splitting can be done using either expert 
opinion or exploratory factor analysis. Whatever method used to select items of ATI, the 



ERLC 



10 



stout's procedure for uuidimensionality - 8 



goal is to select a small subset of items (up to one-fourth of the total test length seems a 
good convention) that all measure the saa\e dominant trait and, at the same time, are as 
dimensionally different as possible from the PT items. Once items for ATI axe selected, a 
second set of M items is selected for AT2 from the remaining items so that AT2 items have 
a difficulty distribution similar to ATI items (Step 6, Siout^4987). The remaining n (= 
iV— 2Af) items then become the partitioning subtest PT. 

Each examinee is assigned to one of if subgroups according to his/her score. on PT. 
After eliminating subgroups vrith too few examinees (•^jujn=20 recommended), within each 
subgroup, ky two variance estimates, the usual variance estimate (cr^, and the 
"unidimensional" variance estimate {<r^^,^e computed using items of ATI. 



with U^^ denoting the response of the jth examinee to the ith item from the fcth subgroup, 




where 




arid denoting the number of examinees in the Ath subgroup. 




where 




The difference in these variance estimates is then normalized by an appropriate 



normalizing constant 5. and summed over subgroups to arrive at the statistic 



ERIC 



Stout's ptocedure for uuidimensionality — 9 



T - ^ ._ Y r "U,k 1 



(2) 



Afc, (3) 



where 
and 

Similarly, using items of AT2, the two variance estimates c^, er^^ and the 
standard error of estimate 5^^ are computed and normalized within each subgroup to arrive 
at the statistic using formula (2). The statistic T to assess departure from essential 
unidimensionality is given by 

T^{T^-Tq)/^. (4) 

The null hypothesis of d^l is rejected if T > Z^, where is the upper 100(l~a) 
percentile of the standard normal distribution, and a being the desired level of significance. 

Correction for Bias in the Statistic Tj^ by Intioductio.i of Tq 



Consider the statistical bias that would result if rather than T were the statistic 
used to assess essential dimensionality. The above description shows that Stout's test is 
based on two variance estimates: the usual variance estimate aj^, and the unidimensional 
vanance estimate o-^^^. If the items of the test measure one dominant trait, then the two 
subtests ATI and PT would contain essentially unidimensional items representing the same 
dominant trait. When the test length is both long and essentially unidimensional, 



12 



Stout's procedure for unidimensioniility — 10 



examinees within each subgroup can be assumed to be of approximately equal ability. 
Consequently, it can be shown that the differences in the variance estimates ^jf^^u^}^ 
computed using items in ATI, would be "small"; thus, using T^, the test will be assessed 
as essentially unidimensional. By contrast, if the test length is long and essentially 
multidimensional, the trait measured by items of ATI would be different from the trait(s) 
measured by the rest of the test, and the ATI differences <rj^-tr^^ would not be small (see 
Stout, (1987) for the heuristics explaining why this holds), and would conclude the test 
to be essentially multidimensional. 

In the case of a relatively short essentially unidimensional test, however, examinees 
within each subgroup are not likely to be approximately equal on the dominant trait 
measured by the test, thereby causing the differences ^}f<^jj]^^^ be large. This improperly 
inflates the value of the statistic Tjr and results in statistical bias. This bias is amplified if 
items of ATI are homogeneous with respect to item difficulty, which often occurs when 
ATI is selected by factor analysis. To correct for this preasymptotic bias in T^, AT2 is 
constructed so that items of ATI and AT2 are closely matched in their item difficulty 
distribution. It has been observed that subtests ATI and AT2 are both subject to similar 
amounts of pre-asymptotic bias, but because AT2 is chosen to be similar to ATI in 
difficulty only, Tq formed from AT2 will not be made larger by the presence of 
multi dimensionality. Thus, as statistical experimental design ideas suggest, the bias is 
cancelled by forming the difference statistic T (Step 6, Stout, 1987). 

Avoiding Bias due to Guessing and High Discrimination of Items 

Test items usually differ with respect to their various measurement properties. 
There may be difficult items, easy items, high discrimination items, low discrimination 
items, and so on. The SCMtem SAT-Verbal vocabulary test analyzed by Lord (1968) is no 



ERLC 



15 



Stout's procedure for unidimensionality — 11 



exception. Item parameter estimates for this test were obtained by LOGIST. DIMTEST 
with a specified level of significance a=.05 was applied to a three— parameter logistic, 
unidimensional, simulation model on various random subtests of 50 items selected horn this 
test. For 100 replications of the DIMTEST on the simulated test data, 5 rejections of the 
hypothesis of d^l were observed — strongly confirming the uni dimensional nature of the 
simulated data. Items of the SAT—Verbal test were then divided into two sets. One set 
consisted of items having discrimination parameters greater than 1.0 
(high— discriminations); other set consisted of items with discrimination parameters less 
than or equal to 1.0 (low-^scriminations). DIMTEST was applied separately to each 
subtest and the results were markedly different. Note from the classical test theory 
perspective that the first test has high reliability and the second test low reliability. 



Table 1 



Table 1 displays the performance of the procedure, for both subtests, administered 
to 750, 1000, and 2000 examinees. In these simulations, seven items were selected in each of 
the assessment subtests based on factor analysis with a J_ • =20. The reported values in 
Table 1 are the number of rejections out of the 100 replications of DIMTEST. 

The number of rejections for the test with low— discriminations is what is to be 
expected on a unidimensional test. However, the rejection rate for the test with 
high— discriminations far exceeds the nominal level of 5/100. Furthermore, as the number of 
examinees increases, the rejection rate also increases. 

This finding was confirmed in another unidimensional simulation, which used the 
ASVAB general science test as its basis. Item parameter estimates for this test were 

u 



Stout procedure for unidimensionality — 12 



obtained by Mislevy and Bock (1984). In this simulation a rejection rate of 13/100 was 
observed with a=.05. Further investigation showed that this elevated rejection rate was 
caused by a preponderance of difScult, highly discriminating items. Thus, there is evidence 
to show that if many items of a test are both highly discriminating and difficult with 
guessing present, the observed type-I error rate may be unacceptably inflated. 

In an attempt to determine the cause(s) for excess bias, Monte Carlo simulations 
were investigated extensively with tests of high— discriminating items. Recalling that items 
for ATI were chosen according to the magnitude of their loadings on the second extracted 
factor (Step 1, Stout, 1937), it was found that in the case of high-discriminations with 
guessing present (with d^l), the second factor was a very pronounced difficulty id^ctoi 
e^en though tetrachoric correlations were used. One of the characteristics of the difficulty 
factor is that very easy and very difficult items have high loadings of the opposite sign. In 
the case of high— discriminations, for unknown reasons, but likely due to the presence of 
guessing, most often very easy items tended to have larger factor loadings in magnitude on 
the second factor than the corresponding collection of very difficult items. Consequently, 
the easiest items terded to be selected for ATI. To control for statistical bias, DIMTEST 
then selects the easiest remaining items for AT2. Therefore, PT is left with mostly difficult 
items. Because examinees are grouped according to their scores on PT, which mostly 
consists mostly of difficult items in this case, the paititioning subtest (PT) tends to 
misclassify low ability examinees. This misclassification is made worse if guessing is 
allowed. Thus, examinee abilities within each assigned subgroup may vary considerably, 
leading to a serious violation of the fundamental assumption of essential independence 
within subgroups. This assumption is critical for the statistic T to adhere closely to the 
nominal level of significance. As a result, the values of the statistic (computed from 
ATI) averaged around 10, the values of Tq (computed from AT2) averaged around 7. 
Thus, the values of T T^'^-^^'jj^/v^ ^^^^ so l^^e that the hypothesis of d^^ = 1 was 



ERLC 



15 



Stout's procedure for unidimensionality — 13 

often rejected.^ Although is supposed to compensate for the bias in Tjr , the bias in 

was so large that compensation was ineffective. 

It is interesting to note that there are tSSL reasons why the SAT subtest with 

low-<iiscriroinations failed to exhibit statistical bias. First, low-Hdiscriminations enhance 

4 

the ability of AT2 to compensate (in a statistical, experimental design balancing sense ) 
for the bias contributed by items of ATI. Second, the SAT subtest with 
low— discriminations has a wider distribution of item difficulty, thereby tending to reduce 
the misclassifi cation of examinees in the formation of subgroups. 

Another unidimensional simulation study was conducted with the same 
high-^scriminations SAT items, but with all c-parameters set to zero, creating a 
high-discriminations 2PL model. There was 1 rejection out of 100 trials. Therefore, the 
presence of guessing coupled with high— discriminations seemed to have caused the inflated 
rejection rate. This is true because, without guessing in the model, a highly pronounced 
difficulty factor is unlikely to appear in the tetracaoric factor analysis and, in fact, did not 
appear in high— discriminations 2PL simulations. Moreover, eliminating guessing reduces 
the problem of misclassification of low-ability examinees. 

Based on the above findings, it was conjectured that when guessing/high 

discrimination items are present, the assignment of examinees to subgroups could be done 

more effectively using PT scores that were based on items that included easy items. This 

was achieved in the following way. First, items of ATI are checked statistically, using the 

Wilcoxon rank sum test, to test if the items of ATI are too easy as a group. If the 

Wilcoxon rank test rejects, the procedure is to replace these items with items of highest 

5 6 

loadings of the opposite sign so that they are still dimensionally homogeneous ' . If the 
Wilcoxon rank test does not reject, items of ATI are retained. Algorithm 1 in the 
Appendix describes this procedure in detail. Items of AT2 are selected, as before, so that 
items in ATI and AT2 have approximately the same difficulty distribution. 

16 



Stout's procedure for unidimensionality — 14 



Automating the Size Af of Assessment Subtests 



As described previously, DIMTEST splits iV items of the test into three subtests: 
ATI and AT2 of length M each, and PT of length n (= 2M). In all the simulation 
studies presented in Stout (1987), the size of the assessment subtests Afwas specified by 
the user a priori. For example, for a 30-item test, 5 or 7 items were used in each of the 
assessment subtests; for a 5(Mtem test, 8 or 12 items were used. By contrast, our aim has 
been to develop an algorithm that automatically determines the size of assessment subtests 
according to the magnitude of item loadings on the second extracted factor. For most 
applications this would seem preferable to the selection of M, a priori, especially by a 
novice user. 

According to Stout's large sample theory for DIMTEST, Af should be small 
compared to N. Extensive Monte Carlo simulations showed that a minimum of four items 
was needed in each of the assessment subtests in order to have reliable variance estimates 
(Nandakumar, 1987; Stout, 1984, p. 31). To determine the maximum size of M {Max M) 
that will yield desirable results, three different sizes of Af were tried: Afax Af = 1/5 of the 
test length. Max M = 1/4 of the test length, and Afox Af = 1/3 of the test length. 
Similarly, to determine the minimum size of factor loading that should be used for 
assigning an item to ATI, three different "starting" values {Start) of factor loadings were 
tried: Start - .25, Start = .20, and Start = .15. An experimental design was set up for 
conducting simulations with all three sizes of Af ox Af and with all three values of Start. For 
each combination of Af ox Af and Starts both type-I error and power were observed over 
repeated trials of DIMTEST with tests of different types. To illustr?^te, let Afox Af = 1/5 
and Start = 0.25. Based on the loadings of the second factor, items witli absolute loading 
greater than .25 are to be considered for ATI selection. The average item loading is 
computed for items with positive loadings and for items with negative loadings. The set 



Stout's procedure for unidimensionality - 15 



with the highest average loading, in absolute value, is selected for ATI and the size of this 
set deterraines M. If the minimum required number of items is not obtained with either 
positive or negative loadings, the start value is decreased by .05 until the minimum number 
of itemfi is foimd. Similarly, if in the selected set more than 1/5 of the items have absolute 
loadings greater than .25, only 1/5 of the items with the highest loadings are included. 
Algorithm 2 in the Appendix describes this procedure in detail. The observation of type— I 
error and power for different values of Max M and Start revealed that Max Af = 1/4 and 
Start = .15 yielded the most desirable results. These values were then used for selection of 
items in simulations reported in the Tables 3 through 7 of this paper. Other combinations 
of Max Mmd Start yielded either an observed type-I error rate that is too high or an 
observed power level that is too low. 



Standard Error Estimation in Stout's Statistic 



The general approach used in the development of Stout's statistic first derived an 
asymptotically valid test statistic and then made adjustments to optimize the 
pre— asymptotic behavior of the statistic, guided by Monte Carlo simulations. 

Stout's statistic to test the hypothesis of essential unidimensionality was built by 
combining information measuring the strength of evidence of the nonunidimensionality 
contributed by each of the k =• l,...,if subgroups of examinees. That is, the goal was to 
construct a statistic using the quantities 

from k subgroups of examinees. Each Xj^ measures nonunidimensionality in the sense that 
Xj^^Q when d^l, and > 0 on average when d^> 1. The most obvious approach is to 



Stout' s procedure for unidimensionality — 16 



add up the contributions of Xj^ and then normalize this sum by an appropriate standard 
error of estimate. When unidimensionality holds, Stout (1987) found the estimated 
asymptotic variance of Xj^ to be 



leading to the statistic 

T" = (T'^' -T'q' 

where 

X 

T' ' = ^ (7) 

Result 6.1 of Stout (1987, page 599) suggests that under regularity conditions when d^l, 
Tj^' and ' should be asymptotically N(0,1) as the number of examinees and the number 
of items both approach od. Moreover, Result 6.4 of Stout (1987, page 601) states that both 
r^' and T' ' should have asymptotic power one when d^\. 

Simulation studies conducted prior to the study reported in Stout (1987) showed 
that, for test lengths and examinee population sizes typically encountered in practice, the 
statistical test ' falsely rejected the hypothesis of unidimensionality more frequently 

Q 

than the nominal error rate . Two modifications for constructing T ol (4) were then 
considered: (a) enlarge S'^' to 5^ of (3) so that the values of Ton the average would be 
smaller, thereby reducing the rate of occurrence of type-I error to close to or even below 
the nominal level, and (b) normalize each X^ by its estimated standard error and then sum 
(instead of first summing and then normalizing as in (7)). This modified statistic T was 



Stout * s procedure for unidimensionality — 17 



used in siraulation studies reported in Stout (1987). However, the observed average type~I 
error (.023) in Stout (1987, Table 2) was well below the nominal level (or = .05). 

Because Sj^^ yielded too large an observed type— I error and yielded too small an 
observed type— I error, the following adjustment to the estimated standard error was 
considered in addition to 5, in the present study. 



Furthermore, a basic question in constructing the statistic T was how to combine 
the building blocks ^j^of (5) into a single appropriately normalized statistic for testing for 
unidimensionality. That is, restricting attention to linear scoring, the search was for an 



weighting procedures were considered. Six new statistics through Tg, as described 
below, were derived as a result of using different weights and standard errors of estimates. 
The objective was to find an improved statistic with an increased observed type-I error to 
approximate the nominal level while maintaining or even improving the power. 

An estimator or test statistic is useful provided it centers on the appropriate 
parameter and had a small standard error. It can be shown that Var is 
minimized, subject to the constraint = 1> by setting = 

[l/var(J^jj.)]/E'j^_j^[l/var(Jfp]. Based on this argument, the statistic - 
{Tj J- To was constructed where 




(8) 



It can be seen that 



appropriate choice of weights {dj^ 1 < k< K) to form ^—i^u^u^ Three different 




Stout's procedure for uniditaensionality -- 18 



■L,l 



(9) 



The statistic was constructe similar to the statistic T of (4) but with 5^ as the 
estimated standard error. That is, ^ given by 



f V7~ 



(10) 



The statistic Tg was constructed with weights as ia but with Sj^ as the estimated 
standard error. That is, Tj^^ is given by 



K X, 



(11) 



Based upon the naive, intuitive idea that those subgroups with more examinees in 
them should receive more weight in the constructed statistic, two more definitions T^ and 
Tg were proposed,where 



1 



k S, 



(12) 



and 



I- 



^kh 



k S, 



(13) 



respectively. 

Lastly, based upon Central Limit Theorem and contrasted with the statistic T of 
(3), the statistic Tg was derived where 



Stout's procedure for unidimensionality — 19 



In summary, Stout's (1987) recommended statistics Tas well as statistics T-^ and 

Tq use Sj^ as the estimated standard error, and the statistics, and Tg use as the 

estimated standard error. The statistics and use weights according to the principle 

of minimum variance with Sj^ and Sj^ as the standard errors of estimates, respectively. The 

statistics and Tg use weights (the number of examinees in each cell) and fJj^ 

respectively, with Sj^ as the standard error of estimate. And finally the statistic Tg is based 

g 

on the usual form of the Central Limit Theorem. 

We decided that statistics = (T^ --Tq ^/[S, i = 1,...6 with different weights 
and standard errors should provide an ample choice of statistics for a simulation study to 
assess whether an improved statistic can be obtained that would be better than using T. 



Monte Carlo Simulation Studies 



A Monte Carlo simulation study was undertaken to study the performance of 
DIMTEST after performing corrections for high-discriminations bias using the Wilcoxon 
rank sum test, automation of the size Af of assessment subtests, and correction for the 
standard error of estimate. In all simulations, •^jj^j^= 20 was adopted. The simulation 
study was designed to be similar to Stout's (1987) study in order to compare the 
performance of the statistic before and after the proposed corrections. 

Two issues were of particular importance in the study: (a) how well the nominal 
level of significance specified by the user (a=.05) is approximated by the observed level of 
significance when d^l^^, and (b) how large the power of the statistical test was in 
various dj^2 settings. 



22 



Stout's procedure for unidimensionality — 20 



The Prdiiiinaxy Standard Error Study 

In a prfciiminary pilot simulation study, the performance of six different statistics 
through Tg was studied and compared with T (after implementing corrections for 
high-^discrimination with guessing and automated M) with respect to type— I error and 
power in various test settings. The results revealed that the statistic yielded a higher 
observed type-I error, closer to the nominal level, and a higher power than T. The statistic 
Tg yielded an unacceptably large type~I error; statistics Tp T^, T^, and Tg differed little 
in performance from T and thus would offer no advantage. Therefore, ihe statistic 
used in the simulations described below, and the results were compared with simulations of 
Stout (1987), obtained by using the statistic T prior to the proposed corrections. That is, T 
used Sj^ and does not correct for high-discriminating/guessing items, nor did it 
automatically select M. By contrast, used corrected for 
high-<iiscriminating/ guessing items and automatically selected M. 

The Unidimensional Simulation Study 

The unidimensional, three-parameter logistic model was used to simulate the test 
data. In order for the simulated test data to reflect real data, item parameter estimates 
were obtained from real data sets for five different tests: SATV, ACTM, ACTE, ASVAB 
AS, ASVAB AR^^. The distributions of item parameters for these five tests are given in 
Table 2, and show that the five tests differ not only in length but also in distribution of 
difficulty and discrimination parameters. For example, ACTE has the lowest mean and 
standard deviation of item discrimination parameters; ASVAB AR had the highest mean 
item discrimination; ASVAB AS had the highest standard deviation of item dibcrimination; 
etc. For each test type, two examinee sample sizes J were studied: 750 and 2000. With the 



23 



Stout's procedure for unidimensionality - 21 



sample size of 750, 250 examinees were used for factor analytic selection of assessment 
items, while the reminder were used to compute the test statistic. With J= 2000, 500 
examinees were used for the factor analysis and the reminder were used for computing the 
statistic. 



Table 2 



Binary item responses were generated as explained below. Examinee abilities were 
randoKily generated from the standard normal distribution. For each simulated examinee, 
the probability, P^9)^ of correctly answering each item was computed using the 
three-parameter unidimensional logistic model. If a uniform random deviate in the interval 
(0,1) was less than or equal to the computed probability P^9)y the examinee was 
considered to have answered the item correctly and was given a score of 1; otherwise, the 
examinee was given a score of 0. 

For each combination of test type and examinee size, DIMTEST (as here modified 
by the Wilcoxon rank sum test, automated Af, and the alternate standard error of estimate 
S^') was replicated 100 times, with new examinee responses being simulated each time. 
The number of rejections out of 100 replications of testing the null hypothesis of essential 
unidimensionality is reported in Table 3. Because the test data is generated firom a 
unidimensional model, the observed level of significance should be close to the nominal 
level, which was set to ,05. 



Stout's procedure for unidimensionality - 22 



Table 3 and Table 4 



Table 3 shows the observed type~I error for all five simulated test types for 

different sample sizes. Of particular interest is the second column: rejection rates for the 

SATV high-Hiiscriminations. Contrasting these results with the rejection rates of Table 1 

shows that, with the proposed correction for excess bias (that is, the Wilcoxon rank sum 

test), the rejection rates have dropped to an acceptable level. For example, the rejection 

rate with 2000 examinees has dropped &om 58 to 7. For other test types, the observed level 

of significance is also close to the nominal level. Table 4 compares these results with those 

of Stout (1987, Table 2) where the statistic T was used. The contents of Table 4 show that, 

as a consequence of the proposed refinements, the observed type-I error rate has increased 

or remained the same for all test types and sample sizes except for ASVAB AR with 2000 

examinees. The overall average observed type— I error has increased from .023 (Stout, 1987) 

to .045 and is very close to the nominal value of .05. In addition, for each one of the cell 

entries, there is no statistical evidence to reject the hypothesis that the nominal level of 

12 

significance of .05 holds. That is, they are all consistent with a p-value of .05 . 

The Two~Dimensional Simulation Study 

The two— dimensional simulation study was modeled according to the 
multidimensional three-parameter logistic model with compensatory abilities (Reckase & 
Mckinley, 1983) given by: 



Stout's procedure for unidimensionality — 23 




(15) 



Seven different test types were considered to study the power of the procedure after 
the proposed changes. Two-Kiimensional counterparts of the five test types used in the 
unidimensional simulation study were simulated in the following maimer. The 



where /i and a were the mean ard the standard deviation of the distribution of 
discrimination parameters of the respective unidimensional test taken from Table 2. 
Likewise, and were assumed to be independent of each other for each item and were 
generated: 



where ii and a were the mean and the standard deviation of the distribution of difficulty 
parameters of the respective unidimensional test taken from Table 2. For example, to 



generated independently from the normal distribution with mean 1.07/2 and standard 
deviation .4/j5. Similarly, the ft^^'s and Jg^'s were generated independently from the 
normal distribution with mean .58 and standard deviation .88. Each test was taken to 
consist of "pure" items dependent on alone, "pure" items dependent on 0^ alone, 
and mixed items dependent on and Q^. 



discrimination parameters (a^^; a^^ of the two dimensions for each item were 
independently generated from a normal distribution; 




generate the two-dimensional counterpart of the SATV test, the /s and a2^'s were 



ERLC 



2b 



stout's procedure for unidimensionality — 24 



Abilities Q = (Bj^, Bj) were generated from a bivariate normal distribution with 
both means being zero and both variances being one. The correlation coefficient p between 
the abilities varied appropriately. The o~parameter was taken to be .20 for all items. 
Binary item responses were generated exactly as described for unidimensional tests using 
(15). 

In addition to the five two-dimensional counterparts of unidimensional tests, two 
more tests, the ACT Mathematics Usage Form 8B (ACTM8B) and the ACT Mathematics 
Usage Form 24B (ACTM24B) were used. For these two tests, estimated two-dimensional 
item parameters (a^^, a^^ and (6^^^, fig;) ^^^^ obtained from the American College Testing 
Program. Except for item parameter generation, which has been replaced by use of actual 
item parameter estimates, the rep^^onses for these two tests were simulated as described 
above. 

For each of the seven test types, two examinee sample sizes J were considered — 750 
and 2000 — and two levels of correlation p were considered — .5 and .7. As in the 
unidimensional study, when J=750, 250 examinees were used for factor analysis, and, when 
7=2000, 500 examinees were used for factor analysis. For each combination of test type, 
examinee sample size, and level of correlation, DIMTEST (as modified by the Wilcoxon 
rank sum test, automated Af, and the alternate standard error of estimate S^) was 
replicated 100 times, each time simulating new examinees. For the first five test types, a 
new set of item parameters was generated for each test after each 10 replications. The 
number of rejections over 100 replications is reported in Table 5 for each case. 



Table 5 and Table 6 



Stout's procedure for iinidimensionality — 25 

In the case of d^2y one wants good power; that is, one wants - ^^^^ 
for a broad range of realistic d^2 alternatives. The contents of Table 5 show that the 
power is extremely high for the case of /?=.5 for both sample sizes. The power is very high 
for /?=.7 with 2000 examinees, and the power is good for p=.7 with 750 examinees. These 
results are noteworthy, considering that all tests in the simulation study consist of at least 
one— third mixed items requiring knowledge of both traits to be answered correctly. 
Furthermore, it can be seen that as the sample size increases, the power also increases. 

Table 6 compares the results of the present study with the results of Stout's 
simulation study which uses the statistic T (1987, Table 6). It can be seen, as a 
consequence of the proposed refinements, that the power has increased for every test type, 
sample size, and level of correlation. On the average, power has gone up from 67 to 88 
rejections per 100 trials of the procedure for the case of p=.5 with 750 examinees^ from 92 
to 99 rejections for the case of p=.5 with 2000 examinees, from 36 to 54 for the case of 
p=.7 with 750 examinees, and from 67 to 90 rejections for the case of p=.7 with 2000 
examinees. These average increases are large enough to be of practical importance. 

Real Data Study 

Four different data sets were used to examine the performance of DIMTEST on 
actual data. Data for two Armed Services Vocational Aptitude Batteries, used by the 
Department of Defense Student Testing Program in high schools and post-secondary 
schools, were obtained from Linn, Hastings, Hu, and Ryan (1987). These tests included 
Arithmetic Reasoning tests for Grades 10 and 12 (ARIO & AR12), each with 30 items and 
1984 and 1961 examinees, respectively. Two more data sets were obtained from American 
College Testing (ACT) Program. These included ACT mathematics usage Forms B and C 
(F29B & F29C), each with 40 items and 2491 and 2494 examinees, respectively. 



2b 



Stout ' s procedure for unidimeiisionality — 26 



DIMTEST was appUed to each of the four data sets. In each data set, 500 examinees 
were randomly selected for factor analysis; the rest were used for computing the statistic. 
Examinees were randomly split into two groups, one group for performing factor analysis 
and the other for computing the statistic, ICQ times — each time testing for the null 
hypothesis of essential unidimensionaUty. The number of rejections over 100 replications of 
the procedure noted. The results for all tests are tabulated in Table 7. 



Table 7 



The contents of Table 7 suggest that, according to the DIMTEST, ARIO and AR12 
should be assessed as essentially unidimensional tests while F29B and F29C should be 
assessed as multidimensional tests. Examination of items of F29B and F29C showed that 
these tests consist of items assessing knowledge of arithmetic and algebra operations, 
geometry, numeration, story problems, and advanced topics. Therefore, from the 
perspective of content, F29B and F29C would seem to be multidimensional tests measuring 
highly correlated abilities. The rejection rate for AR12 is slightly higher than expected for 
an essentiaUy unidimensional test. One or two items highly influenced by another factor 
may contribute to this high rejection rate, or many items may be slightly influenced by a 
second factor. Further investigation is necessary to examine possible reasons. 

Summary and Discussion 

Detailed investigation of DIMTEST for assessing unidimensionality revealed certain 
Umitations. It failed to perform desirably when the test consisted of predominantly 



Stout's procedure for unidiraensionality - 27 



difficult, high-discrimination items coupled with guessing present. This limitation was 
overcome by a more appropriate selection of assessment items. Also, an automated 
approach was devised to determine the size of assessment subtests, and the estimate of the 
standard error of the statistic was adjusted to yield the desired level of significance for 
d^l data and higher power for d^l data. After the proposed refinements were 
implemented, DIMTEST was applied to a variety of simulated tests for different sample 
sizes; these tests were modeled on Stout's (1987) simulation study. 

Comparison of the results of the present study with the results of Stout's (1987) 
study indicates that the proposed refinements have improved the observed level of 
significance. It is now dose to the nominal level for d^l simulations and has considerably 
increased the power for d^2 simulations for different levels of correlations and sample 
sizes. In addition, the procedure has been used on a number of real data sets. The results of 
the real test data study seem to confirm the a priori hypotheses regarding the 

1 o 

dimensionality of these tests. 

The refinements have led to a revised test procedure that is, in particular, more 
robust against unusually high-discrimination parameters with guessing present and that, in 
general, is able to perform more desirably with respect to type-I and type—II errors. 
Moreover, the procedure is automated and totally data— dependent in its selection of 
assessment subtest items, making it more user friendly. The automation of the size of the 
assessment subtests could especially benefit the novice user. Because the power of the 
statistical test heavily relies upon appropriate selection of items for ATI, our simulation 
study provides further evidence that the use of linear factor analysis for selection of these 
items is a promising approach that requires little effort on the part of the user. 

When the statistical test rejects the null hypothesis of essential unidimensionality, 
it is possible to proceed in several ways. One approach would be to reexamine the test and 
assess the complexity of the essential multidimensionality present using DIMTEST, 



ERLC 



30 



Stout's procedure for unidimensionality — 28 



NOFA, and so forth. If inference suggests that each of the different dominant traits 
influences a distinct group of items (i.e., there is a pronounced simple structure), the test 
could be split into several essentially uni dimensional subtests, and each one could be 
analyzed separately using unidimensional IRT models. Alternately, if most of the items of 
the test are each influenced simultaneously by several dominant dimensions, then the 
researcher may need to resort to multidimensional parametric models in order to make 
inferences about the test data (Reckase, 1985, 1989). 

The dimensionality of a set of item responses is conceptually very complex. It is a 
function of items, examinees, and extraneous factors such as type of instruction and stage 
of leanxing. Also, dimensionality is, from the practical perspective, a continuum. Because 
items are multiply determined, among finite length tests {the only kind available in 
applications), there is no such thing as a strictly unidimensional test. But we can still 
describe a given set of item responses as being well modeled by an essentially 
unidimensional test model. Junker (1990, 1991) argues that an index for the continuum of 
dimensionality should be developed with strict unidimensionality, in the sense of fitting 
local independence models on one end and strict essential multidimensionality on the other 
end, with essential unidimensionality in between. Junker and Stout (1991) have developed 
indices for lack of essential unidimensionality, which can be extremely useful for assessing 
the degree of lack of essential unidimensionality when Stout's test of dg=l is rejected. 
Additionally, these indices show when it is safe to use unidimensional estimation 
procedures such as LOGIST or BILOG to arrive at accurate ability estimates. The 
conjecture is that lack of strict unidimensionality is not detrimental, provided dji=l 
modeling provides a good approximation to reality. The number of items influenced by the 
secondary dimensions, as well as the strength of the influence of secondary dimensions, on 
each item should determine how strong the lack of dj^l is. Nandakumar (1991) has 
demonstrated the utility of DIMTEST in assessing essential unidimensionality when test 



31 



Stout's procedure for unidimensionality - 29 



items were influenced by various dimensions to various degrees, and thus strict 
dimensionality exceeded one. Nandakmnar has found that the accuracy of the 
approximation of essential unidimensionality for a test is a function of the proportion of 
test items influenced by the various nondominant traits present and by the strength of the 
influence of these traits. 

Stout's procedure seems very promising for assessing the dimensionality underlying 
a set of items. It is an outgrowth of the conceptual definition of essential unidimensionality 
and was developed to be sensitive to dominant dimensions and insensitive to transient or 
minor dimensions. The procedure is nonparametric (thus avoiding parametric model-~<iata 
fit problems), supported by an asymptotic theory, and is computationally simplistic. 
However, the procedure is relatively new, and its applicability in a variety of realistic 
applications needs to be studied further. Software to run DIMTEST is available from the 
authors. 



ERLC 



32 



Stout's procedure for unidimensionality - 30 



Appendix 

Algorithm 1: Test for DifBculty Factor 

1. Rank the iV items from most difficult (rank 1) to easiest (rank iV). 

2. Compute the sum W. of the ranks of the M items in ATI. 

3. Compute the mean E( VI ^ and the standard deviation SD( of the 

sum VTg under the assumption of randomly distributed ranks: 

E(vr; = ^iw(m) 

SD(W^; -(^M(iV^Ai)(iV^i))^/2 

4. Compute the critical value Cfor under the usual large sample 

approximation: 

where is the upper 100(l-a)th percentile of the standard normal 
distribution and a is the desired level of significance. 

5. If P7 > C, conclude that M items in ATI are too easy. 



Stout's procedure for unidimensionality — 31 



Algorithm 2: The Size M of Assessment Subtests 

Let N = total number of items, Mlow = 4, Mhigh ^ 1 -^J , and 
Maxload = .15. 

1. Compute 

a) Iy=: Number of positive loadings > Maxload, 

b) = Number of negative loadings < --Maxload. 

2. Redefine 

:= min {Mhigh^ /^) 
:= min {Mhigh, I^^. 



3. If both < Mlow and < M/oti;, then define 

Masdoad := Maxload — .05. 
Go to Step 1. 

4. If either or is > Mlow, then let 



/j^ if > Mlow 
if > Mlow 



5. If both > Mlow and /2 > Mlow^ then compute the averages Avgl 
and Avg2 of item loadings for sets corresponding to and 
respectively. Let 

if Avgl > Avg2 
if Avg2 > Avgl 
Moa^/p if Avgl = Avg2 



3'i 



ERIC 



Stout's procedure for unidimensionality - 32 



Notes 

^Throughout, we speak of a unidimensional test, a unidimensional set of items, etc. This 
convenient phrasing represents the more complex reality that the dimensionality of a model 
or a data set rests on the joint influence of test items and examinee population. Items and 
examinees together produce test data that we judge by statistical inference to be 
unidimensional or not. Reckase (1990) writes perceptively on this point. Technically, IRT 
dimensionality is usually defined to be the lowest latent space dimension possible, such 
that monotonidty and local independence hold. 

"^Note that the statistic T^^ computed from ATI is sensitive to dimensionality (that is, it 
can discriminate between d^l vs d^l) and to sources of bias. The idea in introducing 
AT2 is to deliberately make Tq sensitive only to sources of bias but not to dimensionality. 

•^In unidimensional settings where the procedure worked well, typical values of ranged 
roughly from 1 to 5, and typical values of Tq ranged roughly between .6 to 4.0; thus 
typical values of T ranged roughly between -1.0 to 1.5. 

^If a randomized block design with M blocks of size 2 is to be used in an experiment with 
human subjects assigned to control and treatment groups, this experimental design 
technique will work well unless the subjects are too variable. By rough analogy, the higher 
the discrimination parameters, the more "variable" are the items that are being assigned to 
ATI and AT2 and the less effective the difficulty matching method (analogous to blocking) 
of AT2 item selection is in eliminating bias. 

^It can be observed that when items in ATI are replaced (because they are too easy) with 



ERIC 



35 



Stout's procedure for unidimensionality - 33 



items of high loadings of the opposite sign, easy PT items could result, thereby causing 
inaccuracy in subgroup assignment of high— scoring examinees. Simulation results have 
shown that this potential inaccuracy is not as detrimental to the value of the statistic T as 
it was when PT had mostly difficult items. 

We also tried to correct tetrachoric correlations for guessing by following Bock, Gibbons, 
and Muraki (1985) and by using nonlinear factor analyses to diminish the influence of 
difficulty on the second factor loadings. Regarding correction of tetrachorics, we found that 
when guessing values were about .2 in the model, a large percentage of the sample 
correlations was computed as 1 or —1. However, when the guessing levels were arbitrarily 
cut by half, the problem of extreme correlations was reduced. Even with this reduction of 
guessing levels, the items selected for ATI did not differ significantly from those selected 
without correction for guessing. Moreover, the ad hoc method of cutting guessing levels 
defeats the purpose of using the three— parameter logistic model. Therefore correction for 
guessing was not implemented. 

The nonlinear factor extraction program NOFA was used to select items for the 
assessment subtests. We tried two-factor quadratic model for this purpose. In comparing 
the results of linear and nonlinear factor analysis, we found no difference in T— values 
between the two methods. To our surprise the difficulty factor reappeared even with 
nonlinear factor analysis. Therefore, we did not implement nonlinear factor analysis. 

The reason for the word "suggests" instead of "establishes" is that Stout's result actually 
assumes unidimensionality under the stronger assumption of local independence. Further, 
the asymptotic invariance in (6) also assumes the stronger assumption of local 
independence. 



ERLC 



3b 



Stout ' s procedure for unidiraensionality — 34 



{Sy) is an asymptotic variance and fails to account for the overdispersion of Sj^ that 
occurs as examinees in a fixed PT subgroup have varying abilities, even though the test is 
unidimensional. Thus, (5]^')^ will underestimate the true standard error and will yield too 
large a type-I error (see Cox & Snell ,1989, pp 106-110 for a nice discussion of 
overdispersion resulting from varying parameter such as ability). 

^There are, of course, many more possibilities for computing statistics with given weights 
and standard errors of estimate, but those described here were considered the most 
appropriate. 

"'^Technically, our simulations were done with (i=l, implying dg=l. For simulation studies 
for which dj^l, see Nandakumar (1991). 

^^The SATV denotes the SAT-verbal test obtained from Lord (1968); ACTM denotes the 
ACT mathematics usage test, and ACTE denotes the ACT English Usage test, both 
obtained from Drasgow (1987); ASVAB AS and ASVAB AR denote the Armed Services 
Vocational Aptitude test Battery, Auto Shop Information and Arithmetic Reasoning 
respectively, both obtained &om Mislevy and Bock (1984). 

"^"^The standard error for testing the hypothesis of p=.05 vs p^^.OS is approximately 2.2 
trials. Thus, the acceptance region of this test for a set of 100 simulations is given by (.7, 
9.3) trials. 

"^^We say "seem to" because one cannot really know that a real data set is dj^l or d^l. 
Further, the 100 replications of Table 7 are not the result of 100 administrations of the test 
to similar examinee populations, but rather 100 variations of the application of the statistic 



Stout's procedure for unidimensionality - 35 



to one data set that resulted firom one administration of the test. 



38 



Stout's procedure for unidimensionality - 36 



REFERENCES 



Bejar, 1. 1. (1980). A procedure for investigating the unidimensionality of achievement tests 
based on item parameter estimates. Journal of Educational Measurement , 17, 
283-296. 

Bejar, 1. 1. (1983). Introduction to item response theory models and their assumptions. 

In R. K. Hambleton (Ed.), A pplications of item response theory (pp. 1—23). British 
Columbia: Educational Research Institute of British Columbia. 

Berger, M. P., & Knol, D. (1990, April). On the assessment of dimensionality in 
multidimensional item response theory models . Paper presented at the annual 
meeting of the American Educational Research Association, Boston. 

Birenbaum, M., & Tatsuoka, K. K. (1982). On the dimensionality of achievement data. 
Journal of Educational Measurement , 19, 259—266. 

Bock, R. D., Gibbons, R., &: Muraki, E. (1985). Full^nformation item factor analysis 
(MRC Report No. 85-1). Chicago: National Opinion Research Center. 

Carroll, J. B. (1945). The effect of difficulty and chance success on correlations between 
items and between tests. Psvchometrika , 2ig, 347-372. 

Christoffersson, A. (1975). Factor analysis of dichotomized variables. Psvchometrika , 40, 
5-32. 

Cox, D. R, & SneU, E. J. (1989). Analysis of Binary Data , London: Chapman-Hall. 

Drasgow, F. (1987). A study of measurement bias of two standard psychological tests. 
Journal of Applied Psychology , 72, 19-30. 

Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set of test 
items. A pplied Psychological Measurement . 10, 287—302. 

Hambleton, R. K., & Swaminathan, H. fl985). Item Response Theory : Principles and 
a pplications , Kluwer-Nyjhoff Puolishers, Boston. 

Hattie, J. (1984) > An empirical study of various indices for determining unidimensionality. 
Multivariate Behavioral Research . 19, 49-78. 

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. 
A pplied Psychological Measurement , 9, 139-164. 

Holland, P. W. (1981). When are item response models consistent with observed data? 
Psvchometrika . 4&, 79-92. 

Holland, P. W., k Rosenbaum, P. R. (1986). Conditional association and 

unidimensionality in monotone latent variable models. Annals of Statistics . 14? 
1523-1543. 

Hulin, C. L., Drasgow, F., Sc Parsons, L. K. (1983). Item Response Theory . 



3S 



Stout's procedure for unidimensionality - 37 



Homewood, Illinois. Dow Jones— Irwin. 

Humphreys, L. G. (1981). The primary mental ability. In M. P. Friedman, J. P. Das, & 
N. O'Connor (Eds). Intelligence and learning (pp. 87—102). New York: Plenum. 

Humphreys, L. G. (1985). General intelligence: An integration of factor, test, and 

simplex theory. In B. B. Wolman (Ed.), Handbook of intelligence (pp. 201—224). 
New York: Wiley. 

Humphreys, L. G. (1986). An analysis and evaluation of test and item bias in the 
prediction context. Journal of Applied Psychology , 71, 327^33. 

Etazadi-Amoli, J. E., & McDonald, R. P. (1983). A second generation nonlinear factor 
analysis, Psychometrika , 4S, 315—342. 

Junker, B. (1988). Statistical aspects of a new latent trait theory ^ Unpublished doctoral 
dissertation. University of Illinois at Urbana-Champaign. 

Junker, B. (1990, June). Essential independence and structural robustness in item response 
theory . Paper presented at the annual meeting of the Psychometric Society, 
Princeton. 

Junker, B. (1991). Essential independence and likelihood— based ability estimation for 
polytomous items. Psvchometrika ^ 255-278. 

Junker, B., & Stout, W. (1991, July). Structural robustness cf ability estimates in item 
response theory . Paper presented at the 7th European Meeting of the Psychometric 
Society, Trier, Germany. 

Linn, R. L., Hastings, N. C, Hu, G., & Ryan, K. E. (1987). Armed Services Vocational 
Aptitude Battery: Differential item functioning on the high school form . Dayton, 
OH: USAF Human Resources Laboratory. 

Lord, F. M. (1968). An analysis of verbal scholastic aptitude test using Bimbaum's 

three—parameter logistic model. Educational and Psychological Measuremeni , ^8, 
989-1020. 

McDonald, R. P. (1962). A general approach to nonlinear factor analysis. Psvchometrika , 
4, 397^15. 

McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty factors in binary data. British 
Journal of Mathematical and Statistical Psycholog y ^ 27, 82-89. 

Mislevy, R. J. (1986^. Recent developments in the factor analysis of categorical variables. 
Journal of Educational Statistics ^ U, 3-31. 

Mislevy, R. J., & Bock, R. D. (1984). Item operating characteristics of the Armed Services 
Aptitude Battery f ASVAB) . (Tech. Rep. No N00014--83-C-0283), Chicago II: 
Office of Naval Research. 

Muthen, B (1978). Contributions to factor analysis of dichotomous variables. 
Psvchometrika . 43, 551-560. 



4u 



Stout's procedure for unidimensionality — 38 



Nandaktimar, R. (1987). Refinements of Stout's procedure for assessing latent trait 
dimensionality . Unpublished doctoral dissertation, University of Illinois, 
Urbana— Champaign. 

Nandakumar, R. (1991). Traditional dimensionality vs. essential dimensionality. 
Journal of Educational Measurement . 2S, 99-117. 

Reckase, M. D. (1979). Unifactor latent trait models applied :o multifactor tests: 
Results and implications. Journal of Educational Statistics . 4, 207-230. 

Reckase, M. D. (1985). The difficulty of test items that measure more than one 
ability. Applied Psychological Measurement , 401-412. 

Reckase, M. D. (1989). The interpretation and application of multidimensional item 

response theory models; and computerized testing in the instruct ional environment. 
Iowa City, lA: American College Testing Program. 

Reckase, M. D. (1990, April). Unidimensionality data from multidimensional tests and 
multidimensional data from uni dimensional tests . Paper presented at the annual 
meeting of the American Educational Research Association, Boston. 

Reckase, M. D., & McKinley, R. L. (1983, April). The definition of difficulty and 

discrimination for multidimensional item response theory models. Paper presented 
at the annual meeting of the American Educational Research Association, Montieal. 

Roznowski, M. A., Tucker, L. R., & Humphreys, L. G. (1991). Three approaches to 

determining the dimensionality of binary data. A pplied Psvcholo^cal Measurement , 
Ji, 109-128. 

Stout, W. F. (1984). The statistical assessment of latent trait dimensionality in 

psychological testing . Rep. No. N00014-82-K-0486. Urbana, IL: Office of Naval 
Research Technical Report. 

Stout, W. F. (1987). A nonparametric approach for assessing latent trait dimensionality. 
Psvchometrika . ^, 589-617. 

Stout, W. F. (1990). A new item response theory modeling approach with applications 

to unidimensional assessment and ability estimation. Psvchometrika . SS, 293—326. 

Traub, R. E. (1983). A priori considerations in choosing an item response model. In R. 
K. Hambleton (Ed.), A pplications of item response theory . British Columbia: 
Educational Research Institute of British Columbia. 

Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage 
for umdimensional item response theory. Psvchometrika . 5Q, 399-410. 

Zwick, R. (1987a). Assessment of dimensionality of NEAP Year 15 reading data > (ETS 
Research Report 86-4.) Princeton, N.J.: Educational Testing Service. 

Zwick, R. (1987b). Assessing the dimensionality of NAEP reading data. ilQurnal of 
Educational Measurement . 24, 293-308. 



4j. 

ERIC 



Table 1 

Rejection rates per 100 trials for d^=l simulation study using 
estimated item parameters of SAT verbal test with 0.05 



Discrimination Number of Number of examinees 

parameter items 

750 1000 2000 



0 < < 1.0 41 


4 


0 


3 


(low- discriminations) 








1.1 < < 2.0 39 


28 


46 


58 


(high.- discriminations) 









4;d 

ERIC 



Table 2 

Sample distributions of item parameters for the five 
standardized tests used in the study 











ASVAB 


ASVAB 




* 

SATV 


ACTM 


ACTE 


AS 


AR 












30 








i o 


Max a^'s 


2.00 


2.00 


1.58 


2.82 


2.76 


Min a^'s 


0.40 


0.40 


0.11 


0.32 


0.50 


Aean a^^ s 


1 n7 

1 . U 1 


1 no 


0 72 

• t Jit 


1 22 


1.46 


S.D a.'s 

1 


0.40 


0.35 


0.25 


0.70 


0.51 


Max b^'s 


2.50 


1.50 


2.07 


1.27 


1.01 


Min b^'s 


-1.50 


-1.02 


-3.11 


-1.39 


-2.72 


Mean b^'s 


0.58 


0.50 


0.03 


0.09 


-0.02 


S.D b^'s 


0.88 


0.61 


0.96 


0.72 


0.84 


Max c^'s 


0.20 


0.21 


0.27 


0.26 


■ 0.34 


Min c^'s 


0.04 


0.02 


0.04 


0.06 


0.08 


Mean c^^'s 


0.16 


0.14 


0.15 


0.20 


0.19 


S.D c^'s 


0.05 


0.04 


0.03 


0.04 


0.06 



N denotes the test length. 

SATV denotes the SAT verbal test battery. 
ACTM denotes the ACT mathematics usage test battery. 
ACTE denotes the ACT English usage test battery. 
ASVAB AS denotes the Armed Services Vocational 
Aptitude Battery for auto shop information. 
ASVAB AR denotes the Armed Services Vocational 
Aptitude Battery for arithmetic reasoning. 



Table 3 

Results of unidimensional simulation study: Rejection rates for testing 
the null hypothesis of d^^l over 100 trials with c=-20 and a=.05 



J 


SATV* 


SATV 
high dis 


ACTM 


ACTE 


ASVAB AS 


ASVAB AR 


750 


6 


8 


5 


6 


2 


3 


2000 


6 


7 


4 


4 


2 


1 



SATV and ACTE each contain more than 50 items in the pool, but 50 items 
were randomly selected for the study. After each 10 of 100 trials a new 
sample of 50 items was chosen. For other tests the same test was used for 
all 100 trials. 



ERIC 



4i 



Table 4 

Comparison of unidimensional simulation study results of this paper with 
those in Stout (1987): Rejection rates for testing the null hypothesis of 
a[p=l over 100 trials with c=.2 and a=.05 





SATV 


ACTM 


ACTE 


ASVAB AS 


ASVAB AR 


Study 


750 2000 


750 2000 


750 2000 


750 2000 


750 2000 


Stout* (1987) 


2 6 


1 4 


3 1 


1 1 


2 4 


Present 


6 6 


5 4 


6 4 


2 2 


3 1 



♦ 

For all tests the rejection rate reported is the average of rejection 
rates (rounded to nearest integer) for the two different M values 
reported in Table 2 of Stout (1987) . 



ERIC 



45 



Results of two-dimensional simulation study: Rejection rates for testing 
the null hypothesis of d^-l over 100 trials with c=.20 and a=.05 







SATV 


ACTl 


ACTE 


ASVAB AE 


ASVAB A£ 


ACTM24B 


ACTM8B 






17- 17- 16 


13- 13- 14 


17- 17- 16 


8-8-9 


10- 10- 10 


0-0-40 


U- U- 50 






750 2000 


750 2000 


750 2000 


750 2000 


750 2000 


2000 


2000 


p 


= .5 


93 100 


97 100 


81 100 


73 99 


94 98 


99 


100 


p 


= .7 


58 96 


66 97 


37 83 


50 83 


61 91 


69 


98 



ERIC 



4b 



Tabl^^ 6 

Comparison of two-dimensional simulation study results of this paper with 
those in Stout (1987): Rejection rates over 100 trials for testing the 
null hypothesis dy=^l with c=.2, and a=.05 







SATV 


ACTH 


ACTE 


AO TAX 


AO 


AqvAR AB 

AOVAD AU 




if U V m 


17-17 


- ifi 




-14 


17- 17- 16 


B Q 

o- o- 


Q 


iU- xU- XU 




J 


750 


2000 


750 


2000 


750 


2000 


750 


2000 


750 2000 




Stout* (1987) 


62 


98 


69 




59 


90 




87 


76 - 
























Present 


93 


100 


97 


100 


81 


100 


73 


99 


94 98 




Stout* (1987) 


36 


83 




74 




55 




54 


67 


/)=.? 






















Present 


58 


96 


66 


97 


37 


83 


50 


83 


61 91 



For all tests the rejection rate reported is the average of rejection 
rates (rounded to nearest integer) for the two different M values reported 
in Table 6 of Stout (1987) . 



Table 7 

Results of real data study: Rejection ates for testing 
the aull hypothesis of (ijj=l over 100 replications of random 



selection of subjects with a= 


.05 




A&IO 


AR12 


F29B 


F29C 


M: 30 


30 


40 


40 


J: 1984 


1961 


2491 


2494 


6 


13 


86 


82 



48 

ERIC 



STOUT.TCL r JAN ft 

FKOM AtL.ARCA, MSURMNT 

Dr. Terry Ackenawj 
Bduoiiooal Psycbotosf 
JWC Educaiion BIdg, 
Univrrvty d( ItUnois 
ChamiKii^. IL 61901 

Dr. T«tTy All»rd 
Co<k IMKS 
OnWe of N*v*l Rttordi 
fVX) N. Quifwy St 
Ariiniioft, VA 22217-5000 

Dr. NsfKy Allen 
Educjiion*! Te»iini S«v»e« 
Pnocrtoft. NJ 06541 

Df. GrtROfy Anriji 
EJucation*! Tolifll Savice 
PrifKcton, NJ 08541 

Dr. Phipp* Anibt« 

Gmduaie School o( Maoaf '3>ent 
Rutj^cn UnKcrwiy 
V2 So* Street 
Sc^ork, NJ 07102-1895 

Df. liaac !. Ocjar 
Ljw Sc hool Admtuiont 

Service* 
Dox 40 
Nf»-to»fl, PA 

Dr. Wtlum O. Berry 
Director of Life *nd 

Em-ironmeniiJ Science* 
AFOSR.'NL Nt Bld^ 410 
nolliftg AFB. DC 203?2.M4a 

Dr. Tbomat G. Bev«r 
Depanmcnl of Piycboloflr 
Uoftenity of Rocbetlef 
K(vcr Sution 
Rocheiier. NY 14627 

Dr. Mcnucbi BJrenbium 
EJuoilional Tettint 

Service 
Pnncetoo, NJ 08541 . 

Dr. Bruce Bk» 

Dcfcn*e Msnp . Dau Center 
W P:Kif>c Sc 
Suiie 155A 
MooicTo>% CA 9394J.3231 

Dr. Gu\7)eth Boo<ioo 
EJucaiionaJ Tesiinj S«tvic« 
Pnncetoa NJ 08541 

Dr. RichArd L Brtoch 
HO. USMEPCOM/MEPCT 
2<(0 Green Ro«d 
North Chicjja lU <00« 

Df. Robcn Brtnnan 
A-Tjerjcan College Taiini 

ProjTimi 
P. O. Boa 168 
|{7.a Cii>. lA 5243 

Dr. D^id V. Bu<3ct€U 
Depinmeni o^ Piycbdoiy 
Unnertiiy of Haifa 
Mount Cannd, Haifa 31999 
ISRAEL 

Dr. Greiofy Cande* 
CTH'Mi#cMillaa/McGr»w.Hil 
I^UO Garden Ro*<J 
Monterey. CA 9W0 

Dr. PauJ R. Chate^ 

Percepuonict 

1911 North Fl My«f Dr. 

Suire ItOO 

Arlinfjon, VA 22209 



Dr. SuMn Cbipmao 
Cofpkivc Science ProtniB 
Ofike of Njval Re»carcb 
800 North QuincySL 
Aflinttofi. VA 22217-5000 

Dr. Raymood E. CbrUuri 

UES LAMP Scieftce A&Atct 

AUHRMIL 

Brooka AFB. TX 78235 

Dr. NonsM Oiff 
Depaitfflem of Piychoteer 
Univ. of Sa CaMbmia 
Lot Anicioi, CA 90099-1061 

Director 

Ufe SdefKci* Cc<3c 1142 
OfTic* of Nafval Reaearcb 
AiliniioaVA 22217-5000 

OxnmMndini OfTKer 
Naval Re^arcb Laboraioiy 
Code 4827 

WaabiniUM, DC 20375-5000 

Dr. Jobo M. Com*di 
Departa»eot of Ptycboloiy 
I/O Piycboiofir Proiram 
TuUot Univerticy 
N«wOrkaM.LA 70118 

Dr. WitfamCraoo 
DepartBtent of Ptycholofir 
Texas AAM Univeraity 
CoNt|e Sutioa TX 77dO 

Dr. Lif)da Curran 

Defcmc Manpower Dau Center 

Suite 400 

1«00 Wilaon Blvd 

RoM^VA 22209 

Dr. Timwhy Davey 

Araerican CoUeje Teatlnf Projnia 

P.O. Box 168 

Iowa Off, lA 52243 

Dr. Charte* E Dwia 
Educatk>na) Tettini Service 
Mail Siop 22-T 
PnAcetomNJ 08541 

Dr. R*lph J. DeAyala 
M«iaurtmea(. Sutktka, 

and EvAluacioQ 
BcnjaAio fMp, Rm. 1230P 
Ummaky of MaiyUnd 
Coiieie PMt MD 20742 

Dr. Sharon Deny 
Florida Suite University 
DepertAcnt of Piycbotofy 
TaSahaaact, FL 32304 

Hd-Kj Doni 

Beitctx-c 

6 Corporate PL 

RM: PYA-1IC207 

P.O. Box 1320 

PiMAUway. NJ 08&55-1320 

I>. Nctl Donna 
Educational TcMini Scrvk* 
Princeton. NJ 08541 

Dr. Ffiu Draafow 
Untvenky of lUinota 
Depanment of Piychok>0 
403 a Daniel St 
ObaofM^ IL 61820 

Dcfeme Tccbnka) 

Information Center 
Caofteron Statkn. Bk)| 5 
Akxarxiria, VA 22314 
(2Copie.) 



Dr. Rktaard Dursn 
Gndoaie School of Education 
UnivcrUfy of California 
Sanu Bartwfi. CA 93106 

Df. Susan Bmbrcuon 
Univeraity of Karma 
pgychology Depanmef>t 
426 Fruer 
Lawrence. KS 66<MS 

Dr. Georje En^tbard, Jr. 
Drvwion of Educational Su ^i« 
Exnory Univcniiy 
210 Fubbume Btdft, 
Atlanta. GA 30322 

ERJC FaoOity-Acqutatiioru 
2440 Research BKxL, Suite 550 
Roctville, MD 20850-3238 

Dr. Manhall J. Faff 
FarT-Si|ht Ca 
2520 North VerT>on Street 
Ariinitoft. VA 22207 

Dr. Leonard Feldt 
Undquist Center 

for Meaturetnent 
Univeniiy of Io*:i 
I<7wa Gty. lA 52242 

Dr. Richard L Ferguton 
American CoJk|e Tejiinji 
P.O. Box 168 
Iowa City. tA 52243 

Dr. Gerhard Fiacber 
lJebitt*»»e 5 
A 1010. Vienna 
AUSTRIA 

Dr. Myron FuchI 

U.S. Army Headquar.cr; 

DAPE-HR 

The Penuson 

Washinnon, DC 20310-0.VX> 

Mr. Paul Fo^cy 

Nav>' Pcnonnel RicD Center 

San D«so. CA 92152.68(X) 

Chair. Department of 
Computer Science 
Georjt Ma»on Uoiverilty 
Fairfax VA 22030 

Dr. Robert D. Gibbooa 
Univertity of lllinoU at Cbica|o 
NPI 909A, WC 913 
912 South Wood Street 
Chicaio. IL 60612 

Dr. Janice GifTord 
Un/venity of Maiuchusett^ 
School of Education 
AmheaU MA 01003 

Df. Robert GUier 
Lea mini Retearcb 

<c DcNclopment Center 
Univenity of Piiubup|h 
3939 O'Hara Street 
Pituburth, PA 15260 

Dr. Suun R. Goldman 
Pea body College. Box 45 
Vanderbilt Univer>iiy 
Na»hvilk,TN 37203 

Dr. Titnothy Goldimilb 
Department of Piycbotogy 
Univenity of New Mexico 
Albuquerque. NM 87131 



49 



ERIC 



m mum 



Dr. Sb<rm Gotl 

BfooU AFB, TX n235.5*01 

Dr. Ben Cretn 
)ohf» HApktfu Unjverniy 
Dcpanmcnt of Piydvotojy 
CharW* * .Mth Scr«<i 
D..ltimor«. MD 2i:ii 

School of Edixalkxi 
Sianford UnivmiTy 

Dr. RofiAW K. Harabkton 
Univenity of Ma»»»chu»rtU 
Laboratory of Piychomeinc 
•rvl E%->lu*(ivc RcMsrcb 
HiIU Souih. Roo« 152 
Amhcnu M.A 01003 

Dr. DcK>-n M»mi»di 
Un^cnify of l\Unoi0 
51 Otrty Drrv« 
Champaign. IL *18iO 

Dr Pal nek R. Ham$oo 
Computer Sci«n« Depanmcni 
U S. N»v al Aca<Jcmy 
Annapolrt, MD 21402-5002 

Mt. Rebecca Hetitr 

Njvy Per»oofxl RAD Cenitf 

CoJe U 

S^n Diejio. CA 92152-4800 

Dr. Tliorna* M. Hinch 
ACT 

P. O Bo« 16« 
lo*a C.iy. lA 52243 

Dr. Paul W. HoUafxJ 
EJucatiooal Trttinj Sctv»c«, 21-T 
Rniedoic Road 
Prmceiofl. NJ 06541 

Prof. Luu F. Homkc 
Intiiiui fur PiychotofX 
RWTH Aachen 
jac|iemrai»c 17/19 
D'^H<0 Aachen 
WEST GERMANY 

.Mi. Jula S Hou|b 
Cambridge Unfver»ity Pr«a 
40 We»i 201 h Sir«l 
Ne» Yoft. NY 10011 

Dr. William Ho»«l 
Chicf SctcniiM 
AFHRL/CA 

Brocki AFB. TX 78235.5«1 

Dr Hoynh Huynh 
Collc|e of Education 
Unr\ of South Carolina 
Columbia. SC 2^20t 

t>r. .Manm J. Ippd 
CcnKf for the Smif of 
EJucaiion and Inttruction 

P O Boa 9555 
l^iit RB U>deo 
THE NETHERLANDS 

Dr Robcn Jannaron* 
Elet and ComfHMer Eng. Depc 
Unfwtrnry of Souih CaroKna 
Columbia, SC 2920ft 



Dr. Kumar Joa^-dcv 
Univcn^ of Itknoia 
DfpArUDcm of StMiatks 
101 ItfaM Ha 
725 South Wri^hl Street 
ChMTVfNMin, IL 41S20 

Prof«Mor Dooffai H. Jonca 
Graduate School of Manaiement 
Rutien, Th« Sutt Un#«r»ity 

of Nrw itntf 
Newift, NJ 07102 

Dr. Brian Junker 
Came|k>Mel)on Univenity 
DcparuDOK of Stjihtioi 
Pituburgh, PA 15213 

Dr. Mated JuU 
CarTKpe-Mcaon Univeniiy 
Depanmcm of Pfychokjor 
SchenJey Park 
Piitaburih. PA I52l3 

Dr. }. L Kaiwi 
Code 442/JK 

Naval Ocean Spteirw Center 
San Dic|0. CA 92152-5000 

Dr. Michael Kaplan 
OfTtce of Ba»ic Research 
VS. Army Retearch Inautute 
5001 Ebenhower Avenue 
Alc»ndna.VA 22333-5600 

Dr. Jeremy Kilpatrwt 
Department of 

Mathematics Education 
105 Adcrhold HaU 
Unrvenity of Georgia 
Alhem, GA 30<i02 

Ma. Hae>Rtn) Kim 
Unrvcnity of IM«>ois 
Department of Statatict 
101 llUni HaH 
7Z5 South Wnght St. 
Champai'ia 61S30 

Dr. Jwa-keun Kim 
Depart mem of Pfycholo0 
Middle Tennesaee State 

Univenity 
Murfreesboro. TN 37132 

Dr. Sung'Hoon Kin 
KEDI 

92-4 Unyeon*Don| 

Seocfco-Gu 
Seoul 

SOUTH KOREA 

Dr. 0. Gage Kingibury 

Portland Public Schools 

Research and Evaluation Department 

501 North DooA Su-cet 

P. 0. Boi 3107 

PortUnd, OR 972O>.3l07 

Dr. WiUian Koch 
Boi 7244^ Mcas. and EvaL Cir. 
Univenity of Tens- Austjo 
Austia TX 78703 

Dr. James Kraau 
Computer* based Education 

Reacarcfa Labontory 
Univenity of illinoia 
Urbm IL iiaOl 

Dr. PAtnck KyHonan 
AFHRUMOEL 
Brooka AFa TX 78235 

Ml Carolyn Lancy 
1515 Sptncervine Rod 
Spencerviik. MD 20668 



Richard Lantennan 
Commandant (G.PWP) 
US Coast Guard 
2100 Second Sl, SW 
Waahior<«. DC 20593-0001 

Dr. Michael Levine 
ICduratKMMl Piychotojor 
210 Educjtiion Bidg, 
1310 Sooth Sixth Street 
Unrversity of f L at 

Urhana -Champaign 
Champaign. IL 61I20.W90 

Dr. Charks Lewis 

EducaiionsI Tating S«r\-ic< 
Princeton. NJ 085414W)1 

Mr. Hftin*hung U 
Univenity of iMinoie 
Department of Suiistics 
101 mini HsH 
725 South Wnght St. 
Champaign. IL 61820 

Library 

NaMi Training Systcmi Center 
12350 Re»carch ParWay 
Orlando. FL 32526-3224 

Dr, Marcia C Linn 
Graduate School 

of Education, EMST 
Tolmsn H»M 
UnKenity of Califomja 
Berkeley. CA 94720 

Dr. Roben U Unn 
Campui Box 249 
Univenity of Cokirsdo 
Boulder. CO 803O9-0249 

Logicon Inc. (Attn: Ubnf>) 
Tactial and Training Synemi 

Division 
P.O. Box 85158 
San Diego. CA 92138-51S8 

Dr. Richard Luecht 
ACT 

P. O. rUjx 1« 
lo^-a City. lA 52243 

Dr. George B. Maacady 
Department of Meaiurcmeni 

Statistics A Evaluation 
College of Education 
UnKenity of Maryland 
College Part. MD 20742 

Dr. Evans Mandca 
George Mason Univeniry 

4400 UnKeraity DrKe 
Fairfax. VA 22030 

Dr. Paul Mayberry 
Center for Nis-al Anali»i» 

4401 Ford Avenue 
P.O. Box 16268 
Atcjundria. VA 21302 .0.N* 

Dr, Jamei R- McBnde 
HumRRO 

4430 Elmhuni Drive 
San Dtego. CA 92120 

Mr. Chriitopher McCuiker 
Unrwcmiy of llhnoii 
Department of Piycholoc 
603 E. Daniel St. 
Champaign. IL 61120 

Dr. Robert McKinky 
Educational Testing Service 
Princeton. NJ 08541 



iii/:7/v; 



Dr. JcM«pb McLKblan 
Stsy Ptnontid Research 

and Development Ccnicr 
Cole 14 

San Diega CA 92I52-6EQ0 
Abn Mc;*d 

c/o Dr. Mtchad Levinc 
EJucattooal PiydxSoty 
210 EJucaiion Bld^ 
Unheriity of Illiooia 
Champsitn. IL 

Dr. Timothy MUlef 
ACT 

P, O Bni lh8 

aty. lA 52:43 

Dr. Roben MisJcvy 
Edocaiional ToUns Serwke 
Princeioa NJ 06541 

Dr. Iv-o Molen»r 
F*culieii Socialc W«eo«cb»ppcf> 
Rijksunt\«riitek Crooingeo 
Grotc Krjiumai 2/1 

TS Croninien 
The NETHURLANDS 

Dr. E. Muraki 
CJucaiional Teitinj Service 
RoMdilc Road 
Pr.r>cetoo. NJ 06541 

Dr. Raina Nandakucsar 
Educational Studies 
W.Ibrd Hall Room 2l3E 
Univcr»iiy of Delaware 
Sc*%Tt, DE WW 

Academic Prop. A Reaearcb Branch 

Naval Technical Training Cocntnand 

Code N-62 

N/VS Memphia (75) 

.Millmnon. TN 30654 

Dr. ^. Alan Nicewander 
Univcnity of OUaboma 
Department of Ptycbo4ot)r 
Noonaa OK 73071 

Head. Pcnonnel Syaitms Department 

NPRDC (Code J2) 

San Drfgo. CA 92152.«00 

Director 

Training S\iieflu Depanroent 

NPRDC (Code 14) 

S^n Dtcfo. CA 92l52-«a00 

Libraxy, NPRDC 
Code Wl 

San Dtfgo. CA 92152-6600 
Librarian 

Na\al Center for Applted Rcaearch 

lO ArtiHcial (ntellisencc 
N.Aal Research Laboraioty 
Code 5510 

Waihmgtoa DC 20375-5000 

Orr.ce of Na\ aJ R»«rcK 
Code 1142CS 
M) .N Ouincy Street 
Arhnftoa VA ^217-5000 
(ft Coptes) 

Special Auisiani for Research 

M.invpemcnt 
Chief of NmI Personnel (PERS-OUT) 
Department of ibe Navy 
Wwhinjioa DC 20350- 2(C0 

Dr. Juditb Oraaami 
Ma.1 S4op 

NASA Ames Research Center 
M..ffctl F<ld. CA ^35 



Dr. Peter J. Paihley 
EdoeationaJ Testini Service 
Rosedak Road 
Princeton. NJ 06541 

Wayne M. Patience 
Atnerkan Council on Education 
CED Teatlog Service, Suite 20 
One Dupont Circle, KW 
Waahington. DC 20034 

DepL of Adraintstraiive Science* 

Code 54 
Naval Postgraduate School 
Monterey, CA W<M3-5024 

Dr. Peter PiroHi 
School of Education 
Uofvervcy of California 
BetteJey, CA W720 

Dr. Mart D. Reckase 

ACT 

P. O. Bo« 1« 
towa City. lA 52243 

Mr. Steve Reise 
Department of Pfychology 
Univenity of California 
Riverside. CA 92521 

Mr. Louis Roussos 
University of ll!if>oii 
Departmeni of Suiisiica 
101 lUini HaU 
725 South Wn^M St 
Champaign. IL (1820 

Dr. Donald Rubin 
Statistics Department 
Science Center. Room <M 
1 Qjrford Street 
Harvard University 
Cambndgc. MA 0213S 

Dr. Fumiko Samejima 
Department of Piycbolo0 
Univenicy of Tennasee 
3108 Austin Peay BWg, 
Knoavilk. TS 3?%6-0900 

Dr. Mary Schrau 
4100 Parkjide 
Carlsbad, CA 92006 

Mr. Robert Semmea 
N218 EiUott Kan 
Departmeni of Piychdogf 
University of Minnesota 
Minneapolrt. MN 55455 0344 

Dr. Vaiene L Shalin 
Department of Industrial 

Engineering 
StaU University of New Yort 
342 Lawrence D. Dell Hafl 
DufTalo. NY 142M 

Mr. Richard J. Shavelsoo 
Graduate School of Education 
University of California 
Sanu Barbara, CA 93106 

Ms. Kathleen Sheehan 
Educatioflal Testing Service 
Princeton. NJ 06541 

Dr. Kaiuo Shigemasu 
7*9>24 Kugenuma*Kjigan 
Fujiaawa 251 
JAPAN 

Dr. RandaH Shumaker 
Naval Research Labonioty 
Code 5500 

4555 Overtook Avenwe, S.W. 
Waahington. DC 20375-5000 



Dr. Jud>- Spray 
ACT 

P.O. Box 1« 
Iowa City, lA 52243 

Dr. Martha Stocking 
Educaiional Testing Service 
Pnnceion. NJ 06541 

Dr. William Stout 
University of llUnoia 
Department of Statistics 
101 lllini Hall 
725 South Wright St. 
Champaign. IL 41820 

Dr. Kikumi Tatavwta 
Educational Testing Ser\-»ce 
Mail Stop 03-T 
Princetoa NJ 06541 

Dr. David Thisaen 
Pjvchometrk Laboratory 
CB# 3270, Davie Hall 
Untversit)* of North Carolina 
Chapel 1 lilU NC 275»-3:70 

Mr. Thomas J. Thomas 
Fedet^il Expre*s Corporation 
Human Resource Development 
3035 Director Rog^, Suite 501 
Memphis. TN 36131 

Mr. Gary Thomaison 
UnKernty of Illinois 
Educational Psychology 
Champaiga IL ^1920 

Dr. Howard Wainer 
Educational Testing SerNice 
Pnnceton. NJ 06541 

Eiizabeth Wald 

OrTicc of Na\al Technology 

Code 227 

600 North Ooio<y Street 
Arlington. Va 22217.5W0 

Dr. Michael T. Waller 
University of 

WiKoruin'MiNk aukec 
Educational P*>cholOBr Dept. 
Box 413 

M.kaukee. 53201 

Dr. Mmg Mei Wang 
Educatiooil Testing ScrMce 
Mail Stop 03-T 
Princeton. K J 06541 

Dr. Thomai A. Warm 
FAA Academy 
P.O. Box 25062 
Oklahoma City, OK 7311^ 

Dr. Dav>d J. Weiia 
N660 QlKMl Hall 
Univenity of Minnesota 
75 E RKer Road 
Minneapolii, MN 55455 0^-tJ 

Dr. Douglai W'euel 
Code 15 

Navy Perionftel R«5:D Center 
San Diego. CA 92l52 («Strt 

German Mihury 
Represenlatne 
Pe nonatsi i«m m a mt 
Koelner Str. 2^2 
0-5000 Koeln « 
WEST GERMANY 



5i 



ERIC 



BESICr/MlLME 



iM':7.«»: 



Of D»vKi Wky 
Sshoo* of FJucauon 



I)r BriK-e William* 

rX'p,«nmefli of EJuoiional • 
Uni\eT>icy of llbnoii 



IV M.irk \ViU«>n 
S^h»vM of rJuo«>oo 
I'ruvcr*!^ nf Califciniia 
IWrtcky. CA *l7:o 

r>r. rufcne Wino^d 
Dcp^nmcni of Pn-choiofy 
Kmi>r)- Unrvcrvry 
AiL.nu. OA 303:12 

I> M.fnio F. WrtkofT 
IM KSI.KUC 

w r.Kific Sl. Sou* ^556 
Monicrc>-. CA 9?9<0 

Mr John il Wolfe 

S.IXV Pcrvonnd R&D Ctnicf 

S.n'D.ep, CA 9:iS2.<>8W) 

Dr. Kcnuro Y»m«mo<o 

fjuvatjonil Taunt Scivic< 
Ro*cd.«ic Roid 
I'nnvcioa NJ 06541 

Mt Du,4nli Y»n 
i-J^caiionil T«tin|; S*r\»c< 
Pnnccion. NJ fitv541 

Df. WcnJy Yen 
CTTIM.-Grjvi HIO 
IXj M.inic KcKirch Pirtt 
M^muTcy. CA WW) 

Dr. J.nvph L Yoonj 
Sjhoful Science FoufxiiiKm 

Si reel, NW. 
jthinft.m, DC 20<5^ 



52 



